You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.
A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.
One of the ways to expand the customer base is to introduce a new offering of packages.
Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages.
However, the marketing cost was quite high because customers were contacted at random without looking at the available information.
The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.
However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.
You as a Data Scientist at "Visit with us" travel company have to analyze the customers' data and information to provide recommendations to the Policy Maker and Marketing Team and also build a model to predict the potential customer who is going to purchase the newly introduced package.
Key meaningful observations on the relationship between variables
Prepare the data for analysis - Missing value Treatment, Outlier Detection(treat, if needed- why or why not ), Feature Engineering, Prepare data for modeling
Let's start by importing libraries we need.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor,RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, StackingRegressor
from xgboost import XGBRegressor
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, train_test_split
#Loading dataset
# data = pd.read_excel("Tourism.xlsx", sheet_name="Tourism")
xls = pd.ExcelFile('Tourism.xlsx')
datadictonary = pd.read_excel(xls, 'Data Dict')
data = pd.read_excel(xls, 'Tourism')
View the first 5 rows of the dataset.
data.head()
Check data types and number of non-null values for each column.
data.info()
data["PreferredLoginDevice"] = data["PreferredLoginDevice"].astype("category")
data["Occupation"] = data["Occupation"].astype("category")
data["Gender"] = data["Gender"].astype("category")
data["ProductPitched"] = data["ProductPitched"].astype("category")
data["MaritalStatus"] = data["MaritalStatus"].astype("category")
data["Designation"] = data["Designation"].astype("category")
isna() method.**data.isna().sum()
data.median()
# replace the missing values with median value.
# Note, we do not need to specify the column names below
# every column's missing value is replaced with that column's median respectively (axis =0 means columnwise)
data = data.fillna(data.median())
data.isna().sum()
data.dtypes
Summary of the dataset
# Summary of continuous columns
data.describe().T
To do - Identify insights if any from the distributuions.
Number of unique values in each column
data.shape
data.nunique()
Number of observations in each category
cat_cols=['ProdTaken', 'Age', 'PreferredLoginDevice', 'CityTier', 'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisited', 'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisited', 'Designation']
for column in cat_cols:
print('--**'*10)
print(data[column].value_counts())
print('**--'*10)
Product taken flag count: 0 - 3968 & 1 - 920
Preferred login device of customer in last month: Self Enquiry - 3444, Company Invited - 1419
City tier: (1 - 3190, 2 - 198, 3- 1500)
Occupation of customer: (Salaried - 2368, Small Business - 2084, Large Business - 434, Free Lancer - 2)
Gender of customer: (Male - 2916, Female - 1817, Fe Male - 155)
Total number of person came with customer: (1 - 39, 2 - 1418, 3 - 2402, 4 - 1026, 5 - 3)
Total number of follow up has been done by sales person after sales pitch: (1.0 - 176, 2.0 - 229, 3.0 - 1466, 4.0 - 2113, 5.0 - 768, 6.0 - 136)
Product pitched by sales person - (Multi - 1842, Super Deluxe - 1732, Standard: 742, Deluxe - 342, King - 230)
Preferred hotel property rating by customer: (3.0 - 3019, 4.0 - 913, 5.0 - 956)
Marital status of customer - (Married -2340, Divorced- 950, Single- 916, Unmarried-682)
Average number of trip in a year by customer: (1.0 - 620, 2.0 - 1464, 3.0 - 1219, 4.0 - 478, 5.0 - 458, 6.0 - 322, 7.0 - 218, 8.0 - 105, 19.0 - 1, 20.0 - 1 , 21.0 - 1, 22.0 - 1)
Customer passport flag - (0 - 3466, 1 - 1422)
Sales pitch satisfactory score - (1 - 942, 2 - 586 , 3 - 1478, 4 - 912 , 5 - 970)
Customers owns a car flag - (1 - 3032, 0 - 1856)
Total number of children v isit with customer - (0.0 - 1082, 1.0 - 2146, 2.0 - 1335, 3.0 - 325)
Designation of customer in current organization - (Executive - 1842, Manager - 1732, Senior Manager - 742, AVP - 342, VP - 230)
Age of customer, Duration of pitch by a sales man to customer
# data[ np.logical_and(data.Gender=='Fe Male'), ['Gender'] ] = 'Female'
data.loc[(data.Gender == 'Fe Male'),'Gender']='Female'
print(data['Gender'].value_counts())
Histogram Plot and Box Plt after above Observations for all features
sns.histplot(data['Age'])
sns.histplot(data['CityTier'])
sns.histplot(data['DurationOfPitch'])
sns.histplot(data['Occupation'])
sns.histplot(data['Gender'])
sns.histplot(data['NumberOfPersonVisited'])
sns.histplot(data['NumberOfFollowups'])
sns.histplot(data['ProductPitched'])
sns.histplot(data['PreferredPropertyStar'])
sns.histplot(data['MaritalStatus'])
sns.histplot(data['NumberOfTrips'])
sns.histplot(data['PitchSatisfactionScore'])
sns.histplot(data['NumberOfChildrenVisited'])
sns.histplot(data['Designation'])
sns.boxplot(data['Age'])
sns.boxplot(data['DurationOfPitch'])
sns.boxplot(data['NumberOfPersonVisited'])
sns.boxplot(data['NumberOfFollowups'])
sns.boxplot(data['PreferredPropertyStar'])
sns.boxplot(data['NumberOfTrips'])
#Top 5 highest values
data['Age'].nlargest()
Function to create barplots for each feature
cat_cols=['ProdTaken', 'Age', 'PreferredLoginDevice', 'CityTier', 'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisited', 'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisited', 'Designation']
for column in cat_cols:
plt.figure(figsize = (20,15))
sns.countplot(data[column])
plt.show()
cat_cols=['ProdTaken', 'Age', 'PreferredLoginDevice', 'CityTier', 'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisited', 'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisited', 'Designation']
for column in cat_cols:
sns.set(rc={'figure.figsize':(21,7)})
sns.catplot(x=column, y="Age", kind="swarm", data=data, height=7, aspect=3);
cat_cols=['ProdTaken', 'Age', 'PreferredLoginDevice', 'CityTier', 'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisited', 'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisited', 'Designation']
for column in cat_cols:
sns.set(rc={'figure.figsize':(21,7)})
sns.catplot(x=column, y="Age", data=data, kind='bar', size=6, aspect=1.5, estimator=np.mean);
sns.pairplot(
data,
height=4,
aspect=1
);
sns.set(rc={'figure.figsize':(16,10)})
sns.heatmap(data.corr(),
annot=True,
linewidths=.5,
center=0,
cbar=False,
cmap="YlGnBu")
plt.show()
# Separating features and the target column
X = data.drop(['PreferredLoginDevice','Occupation','Gender','ProductPitched','MaritalStatus','Designation'], axis=1)
y = data[['ProdTaken','Age','CityTier','DurationOfPitch','NumberOfPersonVisited','NumberOfFollowups','Passport','PitchSatisfactionScore','OwnCar','NumberOfChildrenVisited']]
# encoding the categorical variables
x = pd.get_dummies(X, drop_first=True)
x.head()
# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1, shuffle=True)
X_train.shape, X_test.shape
r_2 score to optimize the model.Coefficient of determination is used to evaluate the performance of a regression model. It is the amount of the variation in the output dependent attribute which is predictable from the input independent variables.## Function to calculate r2_score and RMSE on train and test data
def get_model_score(model, flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_r2=metrics.r2_score(y_train,pred_train)
test_r2=metrics.r2_score(y_test,pred_test)
train_rmse=np.sqrt(metrics.mean_squared_error(y_train,pred_train))
test_rmse=np.sqrt(metrics.mean_squared_error(y_test,pred_test))
#Adding all scores in the list
score_list.extend((train_r2,test_r2,train_rmse,test_rmse))
# If the flag is set to True then only the following print statements will be dispayed, the default value is True
if flag==True:
print("R-sqaure on training set : ",metrics.r2_score(y_train,pred_train))
print("R-square on test set : ",metrics.r2_score(y_test,pred_test))
print("RMSE on training set : ",np.sqrt(metrics.mean_squared_error(y_train,pred_train)))
print("RMSE on test set : ",np.sqrt(metrics.mean_squared_error(y_test,pred_test)))
# returning the list with train and test scores
return score_list
dtree=DecisionTreeRegressor(random_state=1)
dtree.fit(X_train,y_train)
dtree_score=get_model_score(dtree)
# Choose the type of classifier.
dtree_tuned = DecisionTreeRegressor(random_state=1)
# Grid of parameters to choose from
parameters = {'max_depth': list(np.arange(2,20)) + [None],
'min_samples_leaf': [1, 3, 5, 7, 10],
'max_leaf_nodes' : [2, 3, 5, 10, 15] + [None],
'min_impurity_decrease': [0.001, 0.01, 0.1, 0.0]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)
# Run the grid search
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_tuned.fit(X_train, y_train)
dtree_tuned_score=get_model_score(dtree_tuned)
Plotting the feature importance of each variable
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(dtree_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
feature_names = X_train.columns
importances = dtree_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
rf_estimator=RandomForestRegressor(random_state=1)
rf_estimator.fit(X_train,y_train)
rf_estimator_score=get_model_score(rf_estimator)
# Choose the type of classifier.
rf_tuned = RandomForestRegressor(random_state=1)
# Grid of parameters to choose from
parameters = {
'max_depth':[4, 6, 8, 10, None],
'max_features': ['sqrt','log2',None],
'n_estimators': [80, 90, 100, 110, 120]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)
# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)
rf_tuned_score=get_model_score(rf_tuned)
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(rf_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
feature_names = X_train.columns
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
X_ada = data[['Age']]
y_ada = data[['CityTier','DurationOfPitch','NumberOfPersonVisited','NumberOfFollowups','Passport','PitchSatisfactionScore','OwnCar','NumberOfChildrenVisited']]
# X_ada = data[['Occupation','Gender','ProductPitched','MaritalStatus','Designation']]
# y_ada = data[['ProdTaken','Age','CityTier','DurationOfPitch','NumberOfPersonVisited','NumberOfFollowups','Passport','PitchSatisfactionScore','OwnCar','NumberOfChildrenVisited']]
# Separating features and the target column
# X = data.drop(['PreferredLoginDevice','Occupation','Gender','ProductPitched','MaritalStatus','Designation'], axis=1)
# y = data[['ProdTaken','Age','CityTier','DurationOfPitch','NumberOfPersonVisited','NumberOfFollowups','Passport','PitchSatisfactionScore','OwnCar','NumberOfChildrenVisited']]
X_train_ada, X_test_ada, y_train_ada, y_test_ada = train_test_split(X_ada, y_ada, test_size=0.30, random_state=1, shuffle=True)
X_train_ada
y_train_ada
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train_ada = model.predict(X_train_ada)
pred_test_ada = model.predict(X_test_ada)
train_acc_ada = model.score(X_train_ada,y_train_ada)
test_acc_ada = model.score(X_test_ada,y_test_ada)
train_recall_ada = metrics.recall_score(y_train_ada,pred_train_ada)
test_recall_ada = metrics.recall_score(y_test_ada,pred_test_ada)
train_precision_ada = metrics.precision_score(y_train_ada,pred_train_ada)
test_precision_ada = metrics.precision_score(y_test_ada,pred_test_ada)
score_list.extend((train_acc_ada,test_acc_ada,train_recall_ada,test_recall_ada,train_precision_ada,test_precision_ada))
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(X_train_ada,y_train_ada))
print("Accuracy on test set : ",model.score(X_test_ada,y_test_ada))
print("Recall on training set : ",metrics.recall_score(y_train_ada,pred_train_ada))
print("Recall on test set : ",metrics.recall_score(y_test_ada,pred_test_ada))
print("Precision on training set : ",metrics.precision_score(y_train_ada,pred_train_ada))
print("Precision on test set : ",metrics.precision_score(y_test_ada,pred_test_ada))
return score_list # returning the list with train and test scores
ab_regressor=AdaBoostRegressor(random_state=1)
ab_regressor.fit(X_train_ada,y_train_ada)
pred_train_ada = ab_regressor.predict(X_train_ada)
pred_test_ada = ab_regressor.predict(X_test_ada)
train_acc_ada = ab_regressor.score(X_train_ada,y_train_ada)
test_acc_ada = ab_regressor.score(X_test_ada,y_test_ada)
train_recall_ada = metrics.recall_score(y_train_ada,pred_train_ada)
# test_recall_ada = metrics.recall_score(y_test_ada,pred_test_ada)
# train_precision_ada = metrics.precision_score(y_train_ada,pred_train_ada)
# test_precision_ada = metrics.precision_score(y_test_ada,pred_test_ada)
ab_regressor_score=get_metrics_score(ab_regressor)
ab_regressor_score=get_model_score(ab_regressor)
# Choose the type of classifier.
ab_tuned = AdaBoostRegressor(random_state=1)
# Grid of parameters to choose from
parameters = {'n_estimators': np.arange(10,100,10),
'learning_rate': [1, 0.1, 0.5, 0.01],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)
# Run the grid search
grid_obj = GridSearchCV(ab_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
ab_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
ab_tuned.fit(X_train, y_train)
ab_tuned_score=get_model_score(ab_tuned)
# importance of features in the tree building
print(pd.DataFrame(ab_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
feature_names = X_train.columns
importances = ab_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
gb_estimator=GradientBoostingRegressor(random_state=1)
gb_estimator.fit(X_train_ada,y_train_ada)
gb_estimator_score=get_model_score(gb_estimator)
# Choose the type of classifier.
gb_tuned = GradientBoostingRegressor(random_state=1)
# Grid of parameters to choose from
parameters = {'n_estimators': np.arange(50,200,25),
'subsample':[0.7,0.8,0.9,1],
'max_features':[0.7,0.8,0.9,1],
'max_depth':[3,5,7,10]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)
# Run the grid search
grid_obj = GridSearchCV(gb_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gb_tuned.fit(X_train, y_train)
gb_tuned_score=get_model_score(gb_tuned)
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(gb_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
feature_names = X_train.columns
importances = gb_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
xgb_estimator=XGBRegressor(random_state=1)
xgb_estimator.fit(X_train,y_train)
xgb_estimator_score=get_model_score(xgb_estimator)
# Choose the type of classifier.
xgb_tuned = XGBRegressor(random_state=1)
# Grid of parameters to choose from
parameters = {'n_estimators': [75,100,125,150],
'subsample':[0.7, 0.8, 0.9, 1],
'gamma':[0, 1, 3, 5],
'colsample_bytree':[0.7, 0.8, 0.9, 1],
'colsample_bylevel':[0.7, 0.8, 0.9, 1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)
# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
xgb_tuned_score=get_model_score(xgb_tuned)
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(xgb_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
feature_names = X_train.columns
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Now, let's build a stacking model with the tuned models - decision tree, random forest and gradient boosting, then use XGBoost to get the final prediction.
estimators=[('Decision Tree', dtree_tuned),('Random Forest', rf_tuned),
('Gradient Boosting', gb_tuned)]
final_estimator=XGBRegressor(random_state=1)
stacking_estimator=StackingRegressor(estimators=estimators, final_estimator=final_estimator,cv=5)
stacking_estimator.fit(X_train,y_train)
stacking_estimator_score=get_model_score(stacking_estimator)
# defining list of models
models = [dtree, dtree_tuned, rf_estimator, rf_tuned, ab_regressor, ab_tuned, gb_estimator, gb_tuned, xgb_estimator,
xgb_tuned, stacking_estimator]
# defining empty lists to add train and test results
r2_train = []
r2_test = []
rmse_train= []
rmse_test= []
# looping through all the models to get the rmse and r2 scores
for model in models:
# accuracy score
j = get_model_score(model,False)
r2_train.append(j[0])
r2_test.append(j[1])
rmse_train.append(j[2])
rmse_test.append(j[3])
comparison_frame = pd.DataFrame({'Model':['Decision Tree','Tuned Decision Tree','Random Forest','Tuned Random Forest',
'AdaBoost Regressor', 'Tuned AdaBoost Regressor',
'Gradient Boosting Regressor', 'Tuned Gradient Boosting Regressor',
'XGBoost Regressor', 'Tuned XGBoost Regressor','Stacking Regressor'],
'Train_r2': r2_train,'Test_r2': r2_test,
'Train_RMSE':rmse_train,'Test_RMSE':rmse_test})
comparison_frame
# So plot observed and predicted values of the test data for the best model i.e. tuned gradient boosting model
fig, ax = plt.subplots(figsize=(8, 6))
y_pred=gb_tuned.predict(X_test)
ax.scatter(y_test, y_pred, edgecolors=(0, 0, 1))
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=3)
ax.set_xlabel('Observed')
ax.set_ylabel('Predicted')
ax.set_title("Observed vs Predicted")
plt.grid()
plt.show()